tidycensus and censusapi to get data on placesipumsr package to get data on peopleI am an associate professor in the UTSA Department of Demography and have been at UTSA since 2006.
Research interests include data science, Bayesian methods, education demography and health disparities.
I would recommend MobaXterm for accessing Shamu.
Shamu FAQ, which is very helpful.
With your shamu access account, you log into Shamu. When you first login, you’re on the login node. This is not where you want to run Rstudio. However, if you want to install packages in Rstudio on your Shamu account, this is the best place to do this, since you have access to all of the build tools and compilers.
log in
At the login node, type qlogin and you will be on a computation node.
You need to first load the R 3.5.1 module
then the Rstudio 1.1.383
then you run Rstudio!
And you see your old friend:
Decennial Census
American Community Survey
County Business Patterns
Population Estimates Program - SAIPE and SAHIE
The Census Summary File 1(SF 1) contains the data compiled from the questions asked of all people and about every housing unit.
Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied).
SF 1 includes population and housing characteristics for the total population, population totals for an extensive list of race (American Indian and Alaska Native tribes, Asian, and Native Hawaiian and Other Pacific Islander) and Hispanic or Latino groups, and population and housing characteristics for a limited list of race and Hispanic or Latino groups.
The decennial Census summary file 1 does not contain information on education, socioeconomic conditions or other detailed characteristics of the population.
Up until 2010, the Census bureau surveyed 1 out of every 6 households to measure these characteristics, which was referred to as the Census “long form”, which was tabulated into a product called the Summary File 3. The year 2000 was the last year this survey was conducted, and beginning in 2005, the American Community Survey replaced the “long form” as the tool to measure socioeconomic characteristics of the population. (US Census Bureau, 2010)
The American Community Survey (ACS) is part of the U.S. Census Bureau’s Survey Program and is designed to provide current demographic, social, economic, and housing estimates throughout the decade.
The ACS provides information on more than 40 topics, including educational attainment, language spoken at home, ability to speak English, the foreign born, marital status, migration, and many more. Each year the survey randomly samples around 3.5 million addresses and produces statistics that cover 1-year and 5-year periods for geographic areas in the United States and Puerto Rico, ranging from neighborhoods to congressional districts to the entire nation.
The ACS 1-year estimates are published for areas that have populations of 65,000 or more.
The ACS 5-year estimates are published for all geographic areas, including Census tracts, block groups, American Indian areas, core-based statistical areas, combined statistical areas, Congressional districts, and state legislative districts.
The American Community Survey Summary File (ACSSF) is a unique data product that includes all the estimates and margins of error from the detailed tables and geographies that are published for the ACS. Data contained in the ACS Summary File cover demographic, social, economic, and housing subject areas.
These data represent totals of population counts, along with the measurement errors in those counts for places, not for individuals, which is the subject of the ACS Public Use Microdata.
The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). The PUMS dataset includes variables for nearly every question on the survey, as well as many new variables that were derived after the fact from multiple survey responses (such as poverty status).
Each record in the file represents a single person, or–in the household-level dataset–a single housing unit. In the person-level file, individuals are organized into households, making possible the study of people within the contexts of their families and other household members.
PUMS files for an individual year, such as 2015, contain data on approximately one percent of the United States population. PUMS files covering a five-year period, such as 2011-2015, contain data on approximately five percent of the United States population.
The PUMS files are much more flexible than the aggregate data produced in the ACS summary files, though the PUMS also tend to be more complicated to use. Working with PUMS data generally involves downloading large datasets onto a local computer and analyzing the data using statistical software such as R, SPSS, Stata, or SAS.
Since all ACS responses are strictly confidential, many variables in the PUMS files have been modified in order to protect the confidentiality of survey respondents. For instance, particularly high incomes are “top-coded,” uncommon birthplace or ancestry responses are grouped into broader categories, and the PUMS files provide a very limited set of geographic variables, including state, metropolitan area and public use microdata area, or PUMA.
While PUMS files contain cases from nearly every town and county in the country, towns and counties (and other low-level geography) are not identified by any variables in the PUMS datasets. The most detailed unit of geography contained in the PUMS files is the Public Use Microdata Area (PUMA). PUMAs are special non-overlapping areas that partition each state into contiguous geographic units containing no fewer than 100,000 people each. The 2011-2015 5-year ACS PUMS files rely on PUMA boundaries that were drawn by state governments after the 2000 and 2010 Census.
The PUMS data are most easily accessed from the University of Minnesota’s Integrated Public Use Microdata Series (IPUMS) data archive. This data source processes the data produced by the Census Bureau into more easily comparable and readable data files that are available for all years of the decennial Census and the ACS. (Ruggles et al., 2015)
They tidycensus is part of the tidyverse, and was written and maintained by Dr. Kyle Walker at TCU.
It allows you to dynamically download and map data from the decennial Census and ACS for any level of Census geography, except blocks!
If you want data on places, this is the easiest way to get it.
The Census publishes data for places in summary tables. These follow a pattern for their names, you can find a description of this here. The biggest problem with finding data from the Census is knowing the table name you want.
You can find table names for the ACS here
There are several types of tables the Census publishes.
The Detailed tables are very detailed summaries of the data for places, in the 2015 data there were more than 64,000 tables published. These can be a little overwhelming to use, but we’ll see an example below
Subject tables take some of the detailed tables and compute summaries of them around certain demographic, social or economic subjects. Basically this is one way to get more data related to a subject without having to know all of the individual detail tables you need.
Data Profile tables contain broad social, economic, housing, and demographic information. The data are presented as both counts and percentages. There are over 2,400 variables in this dataset. These are very useful summaries and what I personally rely on for most of my data extracts.
Obtain one at http://api.census.gov/data/key_signup.html
I recommend you install your API key in your Rprofile, just so you don’t have to keep pasting it into your code. To do this, type tidycensus::census_api_key(key = "yourkeyhere", install = T) one time to install your key for use in tidycensus.
As I mentioned above, finding the right table can be a challenge, especially for new data users. tidycensus has the load_variables() function that will load all of the available tables for a specific table type.
For example, if we are interested in variables from the ACS data profile tables, we can load all available variables then use R to search for what we need.
One of the best ways to search is to use grep(), which is a tool for searching for patterns within text, and is SUPER USEFUL!
library(tidycensus)
library(dplyr)
library(sf)
library(ggplot2)
library(censusapi)
?load_variables
## starting httpd help server ... done
v15_Profile <- load_variables(year = 2015 , dataset = "acs5/profile",
cache = TRUE) #demographic profile tables
#Open the data for examination
View(v15_Profile)
#Search for variables by keywords in the label
v15_Profile[grep(x = v15_Profile$label, "Median household"), c("name", "label")]
## # A tibble: 2 x 2
## name label
## <chr> <chr>
## 1 DP03_0062E Estimate!!INCOME AND BENEFITS (IN 2015 INFLATION-ADJUSTED DO~
## 2 DP03_0062~ Percent!!INCOME AND BENEFITS (IN 2015 INFLATION-ADJUSTED DOL~
Also, if you want the names and info for the subject tables, change the dataset = argument to acs5/subject
v15_subject <- load_variables(year = 2015 ,dataset= "acs5/subject",
cache = TRUE) #demographic subject tables
Finally, to view variables in the detailed tables, change the dataset = argument to acs5
v15_detailed <- load_variables(year = 2015 , dataset = "acs5",
cache = TRUE) #demographic detail tables
If you are interested in variables from the decennial census, change the dataset = argument to sf1 or sf3 depending on which decennial summary file you want.
Here is a real example
The data profile tables are very useful because they contain lots of pre-calculated variables.
Here is a query where we extract the median household income in census tracts from the 2015 ACS for Bexar County, Texas. We can also get the spatial data by requesting geometry=TRUE. Using output="wide" will put each variable in a column of the data set, with each row being a census tract.
sa_acs<-get_acs(geography = "tract", state="TX", county = c("Bexar"),
year = 2015,
variables=c( "DP03_0062E") ,
geometry = T, output = "wide")
## Getting data from the 2011-2015 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Using the ACS Data Profile
#create a county FIPS code - 5 digit
sa_acs$county<-substr(sa_acs$GEOID, 1, 5)
#rename variables and filter missing cases
sa_acs2<-sa_acs%>%
mutate( medhhinc=DP03_0062E) %>%
na.omit()
#take a peek at the first few lines of data
head(sa_acs2)
We can immediately map these data as well, because tidycensus can get you the geography corresponding to your data.
Here, I use the dplyr pipe “%>%” to feed the data into ggplot and map the median household income for each census tract in Bexar County in 2015, using a quantile break system.
sa_acs2 %>%
mutate(med_income=cut(medhhinc,breaks = quantile(medhhinc, na.rm=T, p=seq(0,1,length.out = 9)),include.lowest = T))%>%
ggplot( aes(fill = med_income, color = med_income)) +
geom_sf() +
ggtitle("Median Household Income",
subtitle = "Bexar County Texas, 2015 - Quantile Breaks")+
scale_fill_brewer(palette = "Blues") +
scale_color_brewer(palette = "Blues")
The ACS is a survey and in areas where the sample size is small, the errors in estimates can be quite large. As a way to let users know how uncertain the estimates are for any given area, the Census publishes Margins of Error for each estimate published. In states, counties and cities, these margins of error are usually small since the overall sample size in those larger areas is large enough to provide more certainty about the population estimates.
In areas that are smaller, tracts and especially block groups, the margins of error can be be very large relative to the estimates. One way to visualize this is to map the coefficient of variation in the estimates, which is: \(CV= \frac{\sigma}{\theta}\), where \(\theta\) is the estimate of interest.
Here I generate a quantile break for the coefficient of variation in census tract income estimates
sa_acs2 %>%
mutate(cv =(DP03_0062M/1.645)/DP03_0062E)%>%
mutate(cv_map=cut(cv,breaks = quantile(cv, na.rm=T, p=seq(0,1,length.out = 9)),include.lowest = T))%>%
ggplot( aes(fill =cv_map, color = cv_map)) +
geom_sf() +
ggtitle("Coefficient of Variation in Median Household Income",
subtitle = "Bexar County Texas, 2015 - Quantile Breaks")+
scale_fill_viridis_d(option="B")+
scale_color_viridis_d(option="B")
Here is another example where we get data for metro/micropolitan areas in the US. I use a detailed table request this time.
v15_acs<- load_variables(year = 2015 , dataset = "acs5",
cache = TRUE) #regular ACS profile tables
View(v15_acs)
#Search for variables by keywords in the label
v15_acs[grep(x = v15_Profile$label, "Median household"), c("name", "label")]
metro<-get_acs(geography = "metropolitan statistical area/micropolitan statistical area",
year = 2015,
variables=c( "B19013_001E") ,
geometry = F, output = "wide")
## Getting data from the 2011-2015 5-year ACS
For this geography, tidycensus won’t download the geometrys automatically (maybe in a later release), so we use Kyle Walker’s other library tigris for downloading Census geographic data.
library(tigris)
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
##
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
##
## plot
options(tigris_class = "sf") #for use with ggplot2
met_geo<-core_based_statistical_areas(cb=T, year = 2015)
#Filter out territories
sts<-states(cb = T, year = 2015)%>%
filter(!STATEFP%in%c( "60", "66", "69", "72", "78"))
#merge geographic data to table data
met_join<-geo_join(met_geo, metro, by="GEOID")
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform
## for that
#rename variables and filter missing cases
met_join<-met_join%>%
mutate( medhhinc=B19013_001E) %>%
mutate(med_income=cut(medhhinc,breaks = quantile(medhhinc, na.rm=T, p=seq(0,1,length.out = 9)),include.lowest = T))
##Create our map
map1<-met_join%>%
st_transform(crs = 102740)%>%
ggplot( aes(fill = med_income, color = med_income)) +
geom_sf() +
ggtitle("Median Household Income",
subtitle = "MSAs, 2015 - Quantile Breaks")+
scale_fill_brewer(palette = "Blues") +
scale_color_brewer(palette = "Blues")+
geom_sf(data=sts, fill=NA, color="black")
map1
If you would like to zoom in, we can use the mapview library. This is very useful for teaching and for presentations.
library(mapview)
pal <- colorRampPalette(viridisLite::viridis(n=6)) #set colors
mapview(met_join, zcol="med_income", col.regions=pal, legend=T,map.types="OpenStreetMap", layer.name="Median Income")